Humans Learning

Reading data from other sources

true
R
data
readr
readxl
Author

Brian Calhoon

Published

December 6, 2024

Why am I here?       

Welcome to December! 2024 is coming to an end, and I cannot wait to see what 2025 has in store for us! As a reminder, each lesson is designed as a 5 - 10 minute virtual session conducted for EnCompass staff to expand their skills with data, and the means of learning is the R programming language. When I started learning R, I dove into Datacamp and played with the Gapminder dataset a lot. Then, one day I was asked to look at a dataset that I’d get by email. No problem, I thought. Then, the email arrived, and I looked at the .xlsx file on my screen. Hmmmm….How do I put this in my RStudio environment? I couldn’t print it and feed it to it…I couldn’t scan it….I could drop the file in….I was lost. I didn’t understand how to import a file. I walked down the hall and asked a friend. I don’t want you to experience this so consider this post your friend down the hall. Enjoy!

Learning objectives

For this session, the learning objective is to:

  • Import data from a .csv file and a .xlsx file

  • Write data to a .xlsx file

Wait, how does the data get into the Rstudio thingy?

Most data that is used in the workplace comes in the form of a .csv file or a Excel file. If you’re lucky, maybe you can import data directly from your company’s database or you download reliable data from the internet using an API. The reality is that you’ll probably get emails with Excel files attached so it’s important to know how to access these.

Install and load packages

If you already have the packages below installed, then you don’t need to install them again. I #’d the install.packages() lines so that the code doesn’t run for me. You can delete the #’s if you need to install the packages.

Then, activate the packages with the library() so they are active in your session.

# installing packages
#install.packages("tidyverse")
#install.packages("readxl")
#install.packages("writexl")
#install.packages("here")
#install.packages("writexl")

# another option for installing packages
#install.packages(c("tidyverse", "readxl", "writexl", "here", "writexl"))


library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.4     βœ” readr     2.1.5
βœ” forcats   1.0.0     βœ” stringr   1.5.1
βœ” ggplot2   3.5.2     βœ” tibble    3.2.1
βœ” lubridate 1.9.4     βœ” tidyr     1.3.1
βœ” purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(writexl)
library(here)
here() starts at C:/Users/brian/OneDrive/Documents/website

Read in files

Previously our data was included in the gapminder package. You can find lots of training data sets in packages, and this is really useful, but it’s not how you will get most of your data in the workplace. So, let’s get to it.

CSV file

# read the data in and assign it to a variable
df_csv <- read_csv(here::here("posts/read file/testdata2.csv"))
Rows: 20 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): location, type_visit, gender, v1, text_response
dbl (4): respondentid, v2, lat, long

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
#check the data

dplyr::glimpse(df_csv)
Rows: 20
Columns: 9
$ respondentid  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ location      <chr> "Lubelskie", "Lubelskie", "Lubelskie", "Lubelskie", "Lub…
$ type_visit    <chr> "in person", "in person", "in person", "in person", "in …
$ gender        <chr> "M", "F", "M", "F", "F", "F", "M", "M", "M", "F", "F", "…
$ v1            <chr> "yes", "no", "yes", "no", "yes", "no", "yes", "yes", "no…
$ v2            <dbl> 1, 5, 6, 2, 7, 1, 3, 3, 5, 7, 8, 9, 4, 2, 3, 1, 6, 4, 2,…
$ text_response <chr> "I liked the service.", "I did not like the service.", "…
$ lat           <dbl> 51.01707, 51.27283, 50.92471, 50.35257, 51.50020, 51.745…
$ long          <dbl> 23.12068, 23.53005, 23.57209, 23.03679, 23.24230, 21.981…

Success!

Excel file

# read the data in and assign it to a variable
df_xlsx <- read_xlsx(here::here("posts/read file/testdata2.xlsx"))

#check the data

dplyr::glimpse(df_xlsx)
Rows: 20
Columns: 9
$ respondentid  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ location      <chr> "Lubelskie", "Lubelskie", "Lubelskie", "Lubelskie", "Lub…
$ type_visit    <chr> "in person", "in person", "in person", "in person", "in …
$ gender        <chr> "M", "F", "M", "F", "F", "F", "M", "M", "M", "F", "F", "…
$ v1            <chr> "yes", "no", "yes", "no", "yes", "no", "yes", "yes", "no…
$ v2            <dbl> 1, 5, 6, 2, 7, 1, 3, 3, 5, 7, 8, 9, 4, 2, 3, 1, 6, 4, 2,…
$ text_response <chr> "I liked the service.", "I did not like the service.", "…
$ lat           <dbl> 51.01707, 51.27283, 50.92471, 50.35257, 51.50020, 51.745…
$ long          <dbl> 23.12068, 23.53005, 23.57209, 23.03679, 23.24230, 21.981…

Success!

Mutate the dataset

We can see that both datasets are identical with 20 rows (observations) across 9 columns. So, pick either one and we’ll add a variable called income using the mutate() function. This will add a column, and we will save it under as a new object that we will write back to an Excel file so we can email to our colleague.

# add a new variable to the dataset
df_add <- df_xlsx |> #pipe df to the mutate function
  dplyr::mutate(income = sample(1000:7500, 20))

#view first 6 rows

head(df_add)
# A tibble: 6 Γ— 10
  respondentid location  type_visit gender v1       v2 text_response   lat  long
         <dbl> <chr>     <chr>      <chr>  <chr> <dbl> <chr>         <dbl> <dbl>
1            1 Lubelskie in person  M      yes       1 I liked the …  51.0  23.1
2            2 Lubelskie in person  F      no        5 I did not li…  51.3  23.5
3            3 Lubelskie in person  M      yes       6 The service …  50.9  23.6
4            4 Lubelskie in person  F      no        2 Amazing        50.4  23.0
5            5 Lubelskie in person  F      yes       7 Poor           51.5  23.2
6            6 Lubelskie remote     F      no        1 Some great t…  51.7  22.0
# β„Ή 1 more variable: income <int>
#view last 6 rows
tail(df_add)
# A tibble: 6 Γ— 10
  respondentid location type_visit gender v1       v2 text_response    lat  long
         <dbl> <chr>    <chr>      <chr>  <chr> <dbl> <chr>          <dbl> <dbl>
1           15 Opolskie remote     M      no        3 I would have …  50.5  17.6
2           16 Opolskie remote     F      no        1 Yes, the team…  50.8  18.4
3           17 Opolskie remote     F      no        6 I didn't appr…  50.5  17.7
4           18 Opolskie telephone  F      yes       4 More support …  50.8  17.3
5           19 Opolskie telephone  M      no        2 I was told th…  51.0  18.5
6           20 Opolskie telephone  M      yes       2 I forgot my I…  50.7  17.3
# β„Ή 1 more variable: income <int>

We’re using the basic manipulation skills that we saw with the Gapminder dataset in the previous lessons.

Write it to Excel or CSV

Here all we need to do is specify the object and provide it a path where we want the file to be saved. You should write it to somewhere in your documents folder to avoid any issues with One Drive or Google Drive. N.B. with write_csv() path becomes file.

write_xlsx(x = df_add, path = here::here("posts/read file/test_out.xlsx"))

write_csv(x = df_add, file = here::here("posts/read file/test_out.csv"))

Now you can email this updated .xlsx/.csv file to your colleague. :)

Have fun!

Now it’s your turn practice! Below is a fully functioning code editor with starting code in place. Feel free to experiment with different grouping variables in the group_by() call or to adjust the summary statistic in summarize(). Then, have fun with the plot!